-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Batch vote metrics (BFT-486) #157
Conversation
928cf60
to
8a44e6d
Compare
pub(crate) last_added_vote_batch_number: vise::Gauge<u64>, | ||
|
||
/// Batch number of the last batch signed by this attester. | ||
pub(crate) last_signed_batch_number: vise::Gauge<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can postpone this, but can we have a histogram
that would show the distribution of votes delay after an l1 batch was produced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds interesting, I will create a ticket for this. There are a few challenges:
- the timestamp of the batch isn't available here
- currently we can receive votes earlier than we have the batch itself, so we would have to store a timestamp with the vote, and the retrospectively emit negative latencies when the batch is produced on the local node
We can have some specific workarounds:
- For example when the
last_signed_batch_number
above changes it can act as an approximation of when the L1 batch was produced, but it's only an approximation because it depends on the polling frequency, and also it's only available on nodes which are attesters. - We could observe the time elapsed between the first and last vote. For attester nodes this would indicate the speed of propagation of votes over gossip, compared to the vote produced locally, although again the local vote might come later than the first gossiped one.
- We could implement Grzegorz's idea about only gossiping votes to nodes which have indicated they have the batch already, which would deal with the negative values; this simplifies the metrics in that we don't have to buffer timestamps and observe them the first time we learn about the batch.
- We could observe the difference between
l1_batches.created_at
andl1_batches_consensus.created_at
when we insert the QC, which would be the overall latency of gossiping votes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the following metrics to the Consensus Dashboard on Grafana: The queries behind them are the following:
|
What ❔
Adds metrics around the gossiping of batch votes and the storage of batches.
Added
gossip::metrics::BATCH_VOTES_METRICS
:network_gossip_batch_votes_committee_size
to show how many attesters we have, which can be correlated with the number of votes we receive over time. Wanted to add a weight based one as well, but not sure if counting the weights monotonically would not overflow the counter.network_gossip_batch_votes_votes_added
counts the votes received since start; with its rate-of-change we can get an idea of how many attesters are producing votesnetwork_gossip_batch_votes_weight_added
counts the total weight of votes added since start, each normalized to [0; 1] range with the total committee weight; this way it shouldn't overflow, and its rate of growth should be 0.8-1.0 per batch.network_gossip_batch_votes_min_batch_number
shows the minimum batch we expect votes for; going up will indicate that QCs are formednetwork_gossip_batch_votes_last_added_vote_batch_number
is the number from the last vote we registered; apart from Byzantine votes it should average to go up as L1 batches are created and votes are received, even if a QC cannot be formednetwork_gossip_batch_votes_last_signed_batch_number
indicates that the current node is publishing its votes and keeping up with the L1 batchesAlso implement metrics for the
BatchStore
similar to how theBlockStore
has them:next
forqueued
andpersisted
PersistedBatchStore
operationswait_for_...
methodszksync_consensus_storage_batch_store_last_persisted_batch_qc
: last batch QC savedThe latencies for some of the
BatchStore
methods could potentially be less cumbersome if we used instrumentation.Why ❔
To make the voting process observable.